In order to get the data for this tutorial, you’ll need to get an API key to use the Census API service here. Once you have it, assign the key to the variable api_key.
api_key <- "put api key here"
Now that you have the API key, run the following code that is available on the Storytelling repository on Github. This will automatically pull in code that and assemble the dataset for the visualizations in this section.
source("https://raw.githubusercontent.com/CommerceDataService/cda_storytelling_in_r/gh-pages/get_data.R")
Sometimes two dimensional visuals are not enough. There is a lot more to the data that can be used to contextualize latent patterns. Often times, many analysts tend to think in two-dimensions – like scatter plots. But there’s more to it. In the dataset that you’ve just imported, it has the following characteristics:
summary(data)
## GEOID state geography region
## Length:3142 Min. : 1.00 Length:3142 Min. :1.000
## Class :character 1st Qu.:18.00 Class :character 1st Qu.:2.000
## Mode :character Median :29.00 Mode :character Median :3.000
## Mean :30.28 Mean :2.669
## 3rd Qu.:45.00 3rd Qu.:3.000
## Max. :56.00 Max. :4.000
##
## name pct_poverty pct_unemp pct_hs_grad
## South Region :1422 Min. : 0.00 Min. : 0.00 Min. :46.70
## Midwest Region :1055 1st Qu.: 8.30 1st Qu.: 3.70 1st Qu.:80.70
## West Region : 448 Median :11.50 Median : 4.90 Median :86.40
## Northeast Region: 217 Mean :12.35 Mean : 4.94 Mean :84.98
## Alabama : 0 3rd Qu.:15.30 3rd Qu.: 6.10 3rd Qu.:90.10
## Alaska : 0 Max. :45.50 Max. :20.90 Max. :98.70
## (Other) : 0
Let’s say we were provided a nice clean set of data that contains the following:
What can you do with that data? Well, turns out that that these quantities are related.
How did we get to this?
#Load in Threejs library
library(threejs)
We can see that there are direct relationships between unemployment, poverty and education attainment. But there isn’t much detail and the graphs aren’t pretty.
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad)
Let’s stylize the plots. First let’s name the axes with axisLabels, which accepts a vector of axis names. The order matters and is as follows: x-axis, z-axis, y-axis
#Note that axis Labels should follow this order= c(x, z, y)
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"))
Now let’s change the rendering engine to give more depth to the plot. We do so by changing renderer = “canvas”. This just tells R threejs to use a different package to render the points
#Depth using render
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
renderer="canvas")
Now, let’s set the color of the points, resize the points, and flip the y axis so it’s ascending from the origin. To do so, we: - set col = “slategrey” - set flip.y = FALSE - set size = 0.5
#Point size, color, don't flip y axis
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
renderer="canvas", flip.y=FALSE, col="slategrey",
size=0.5)
Ultimately, we want to find more patterns. By using color, we can group regions by color. We can see some regions are worse off than others. But which? Turns out there are 4 regions:
unique(data$region_name)
## NULL
unique(data$region)
## [1] 1 2 3 4
First, let’s set each region to a different color by first creating a new variable for colors data$colors, then assign a hexcode to each region.
#Set up colors by
data$colors <- ""
data$colors[data$region==1] <- "#011efe0"
data$colors[data$region==2] <- "#0bff01"
data$colors[data$region==3] <- "#fe00f6"
data$colors[data$region==4] <- "#fdfe02"
Now, let’s set col= data$colors so that R knows which color corresponds to each of the 3000 points.
data <- data[order(data$region),]
#Grouped patterns
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
col=data$colors, flip.y=FALSE,
renderer="canvas",
size=0.5)
It’s a bit annoying to look at the chart without knowing which point corresponds to which county. Let’s add labels for each point that show up upon mousing over.
#add labels to points
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
col=data$colors,
labels=paste(data$region_name,": ",data$geography),
size=0.5,
renderer="canvas")
In short, we can tell the following key insights from this graph.
Sometimes graphs don’t get the point across. Maps, while over used, can provide some better indication of patterns.
Based on our 3-d graphs, we could see clustering of regions’s economic performance. We can see the mess of points more clearly on a map.
## OGR data source with driver: ESRI Shapefile
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
We can use the leaflet library to bring a geographic spin to the data. To initiate a map, we only need to open the leaflet library, then run the following:
library(leaflet)
leaflet()
You’ll see that the map is blank with a zoom control panel on the upper left. That’s because the map doesn’t have data in it. There are dozens on free layers we can use:
leaflet() %>%
addProviderTiles("Stamen.Toner")
leaflet() %>%
addProviderTiles("CartoDB.Positron")
Now let’s center and zoom in on the contiguous US at coordinates lon = -98.3 and lat = 39.5
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4)
We now need to download the shapefiles, which are a popular geospatial vector data format for geographic information system (GIS) software. Shapefiles allow for rendering of various types of data, including points (e.g. coordinates), polygons (e.g. county boundaries), and lines (e.g. streets, creeks). We’re going to use the US County Shapefile from the US Census: http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip. To download it and load it into R, we’ll need to first install the rgdal library. We’ve also written a function shape_direct that can be run to import the shapefile and assign it to an object shp.
shape_direct <- function(url, shp) {
library(rgdal)
temp = tempfile()
download.file(url, temp) ##download the URL taret to the temp file
unzip(temp,exdir=getwd()) ##unzip that file
return(readOGR(paste(shp,".shp",sep=""),shp))
}
shp <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip",
shp= "cb_2014_us_county_20m")
## OGR data source with driver: ESRI Shapefile
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
## Warning in readOGR(paste(shp, ".shp", sep = ""), shp): Z-dimension
## discarded
With the new shapefile now imported, we can now set data = shp.
leaflet(data=shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = "blue",
fillOpacity = 0.8,
color = "white",
weight = 0.5)
In order to draw insight from a map, we’ll need to color code county polygons. This is known as a choropleth map – each county is color coded with respect to a certain value of a given variable like %poverty. The shapefile on its own doesn’t have the socioeconomic data and we’ll need to join the data to the shapefile. Let’s just quickly check the data formats of the primary key GEOID.
str(shp@data$GEOID)
## Factor w/ 3220 levels "01001","01003",..: 560 741 2222 2660 2471 2200 2392 2707 770 853 ...
str(data$GEOID)
## chr [1:3142] "09001" "09003" "09005" "09007" "09009" ...
Since the shp primary key is in a factor format and the data primary key is in string or character format, we’ll need to conform the formats, preferrably to strings. Then, we can merge the two datasets.
shp@data$GEOID <- as.character(shp@data$GEOID)
shp <- merge(shp,data,id="GEOID")
With the merged datasets, we’ll now need to specify a color scheme. Using colorQuantile, we can create a create a function that will slice any continuous variable into bins and assign colors to each bin. The syntax is as follows:
pal <- colorQuantile(<Color Code>, <variable>, n = <number of bins>)
For our example, we’ll use a Yellow-Green palette, leave
palette <- colorQuantile("YlGn", NULL, n = 30)
Next we will add popup text for when a user clicks on a county in a map.
county_popup <- paste0("<strong>County: </strong>",
shp@data$geography,
"<br><strong>Poverty Rate (%): </strong>",
shp@data$pct_poverty)
Now we’ll pull it all together.
leaflet(data = shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = ~palette(pct_poverty),
fillOpacity = 0.8,
color = "#BDBDC3",
weight = 0.1,
popup = county_popup)
We can run the same graph for pct_unemp by swapping out pct_poverty
county_popup <- paste0("<strong>County: </strong>",
shp@data$geography,
"<br><strong>Unemp (%): </strong>",
shp@data$pct_unemp)
leaflet(data = shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = ~palette(pct_unemp),
fillOpacity = 0.8,
color = "#BDBDC3",
weight = 0.1,
popup = county_popup)
```